-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add __array_ufunc__ to Series / Array #23293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-07-01 18:48:08 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question is if we want to force EA authors to implement an array_ufunc (eg by having a default implementation returning NotImplemented).
Would the motivation here be to simplify pandas internal code? We can skip any checks for it?
BTW, if you also already had some stuff on this, feel free to push the branch if I could use something of it |
Considering the IntegerArray |
@TomAugspurger the first option (use pandas/pandas/core/arrays/sparse.py Lines 1450 to 1487 in 5a0c91e
|
Codecov Report
@@ Coverage Diff @@
## master #23293 +/- ##
==========================================
- Coverage 92.04% 91.99% -0.05%
==========================================
Files 180 180
Lines 50714 50848 +134
==========================================
+ Hits 46679 46777 +98
- Misses 4035 4071 +36
Continue to review full report at Codecov.
|
pandas/core/series.py
Outdated
@@ -623,6 +623,29 @@ def view(self, dtype=None): | |||
return self._constructor(self._values.view(dtype), | |||
index=self.index).__finalize__(self) | |||
|
|||
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): | |||
inputs = tuple( | |||
x._values if isinstance(x, type(self)) else x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: could move type(self)
outside this expression, so that it's only evaluated once.
cc @shoyer if you have thoughts here as well. |
…] underlying values are object array -> not always doing the correct thing)
pandas/core/arrays/integer.py
Outdated
np.array(x) if isinstance(x, type(self)) else x | ||
for x in inputs | ||
) | ||
return np.array(self).__array_ufunc__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to call the public ufunc again here, e.g., return getattr(ufunc, method)(*inputs, **kwargs)
. Otherwise you don't invoke the dispatching mechanism again.
See NDArrayOperatorsMixin for an example implementation.
pandas/core/arrays/integer.py
Outdated
if (method == '__call__' | ||
and ufunc.signature is None | ||
and ufunc.nout == 1): | ||
# only supports IntegerArray for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should check for isinstance(.., IntegerArray)
?
pandas/core/arrays/integer.py
Outdated
# for binary ops, use our custom dunder methods | ||
result = ops.maybe_dispatch_ufunc_to_dunder_op( | ||
self, ufunc, method, *inputs, **kwargs) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you use a different sentinel value None
rather than Python's standard NotImplemented
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a specific reason, just that I didn't return anything in the maybe_dispatch_ufunc_to_dunder_op
, which means the result is None (and our own dunder ops will never return None).
But can make it more explicit.
pandas/core/series.py
Outdated
ufunc, method, *inputs, **kwargs) | ||
if result is NotImplemented: | ||
raise TypeError("The '{0}' operation is not supported for " | ||
"dtype {1}.".format(ufunc.__name__, self.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this (checking for __array_ufunc__
and then the branch/error message), you could just reinvoke the public ufunc again (as I suggested for IntegerArray). That would also make me more confidence that the proper delegation is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that sounds like a better idea :)
pandas/core/series.py
Outdated
if result is NotImplemented: | ||
raise TypeError("The '{0}' operation is not supported for " | ||
"dtype {1}.".format(ufunc.__name__, self.dtype)) | ||
return self._constructor(result, index=self.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that not every ufunc returns one result, so you should either handle or explicitly error for the other cases. See NDArrayOperatorsMixin for how to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, do you have the same problem with the current implementation of __array_wrap__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer my own question: yes, at least for ufuncs that return a scalar:
In [33]: s = pd.Series([1, 2, 3])
In [34]: np.add.reduce(s)
Out[34]:
0 6
1 6
2 6
dtype: int64
In [35]: pd.__version__
Out[35]: '0.23.4'
But for ufuncs that return multiple values it seems to work with pandas master (wrapping each return element I suppose).
pandas/core/series.py
Outdated
@@ -623,6 +623,29 @@ def view(self, dtype=None): | |||
return self._constructor(self._values.view(dtype), | |||
index=self.index).__finalize__(self) | |||
|
|||
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally we recommend doing a type check and returning NotImplemented
if you see an unknown type . If __array_ufunc__
always tries to do the operation, you can get non-commutative operations, e.g., a + b != b + a
.
For example, consider another library like xarray that implements its own version of a pandas.Series
. Series + DataArray
and DataArray + Series
should consistently give either Series
, DataArray
or an error (probably the safest choice) -- they shouldn't just return a result of the same type as the first argument.
This might even be an issue for Series + DataFrame
or np.add(Series, DataFrame)
-- you probably want to delegate to the DataFrame in that case, not the Series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking somehow: the __array_ufunc__
of the underlying values we call here is already doing the check, so I don't need to do it here again.
But indeed, that will probably mess up such cases.
The problem is that for now the __array_ufunc__
uses our own implementations of all arithmetic/comparison operators, so I should somehow list here everything that all those different ops accept, which might be a bit annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt you need to list types dependent on the operators -- it's probably enough to list the types that can be used in valid operations with pandas.Series
.
It's true that you would possibly need to recurse into extension arrays to figure this out all the valid types, and it may not be easy to enumerate all scalar types, e.g., if IPAddressExtensionArray can handle arithmetic with IPAddress, how could pandas.Series know about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it are mainly the scalar values which might be hard to list (apart from those, we would probably accept Series, Index, array, and a bunch of array-likes (list, tuple, ..) for the Series case).
But eg a Series with datetimelike values can already handle our own Timestamp, Timedelta and offset objects, numpy scalars, datetime
standard library scalars, integer, ..), and as I understand those should all be listed here?
(but anyhow, if we implement an __array_ufunc__
for eg PeriodArray, we would need to do that there as well, so need to make this list in any case)
But what you mention about custom ExtensionArrays is then even another problem, that we don't know what they would accept. By "recurse into extension arrays", would you mean something like checking the ExtensionArray._HANDLED_TYPES
(and then expecting EA authors to set this specific property)? Or some other way to inspect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "recurse into extension arrays", would you mean something like checking the
ExtensionArray._HANDLED_TYPES
(and then expecting EA authors to set this specific property)? Or some other way to inspect?
Yeah, that's about the best I can imagine. But I agree, it doesn't seem easy! There's no general way to detect what types another object's __array_ufunc__
method can handle without actually calling it.
It is probably reasonable to trade-off correctness for generality here, but we should be cognizant that we are making this choice. Alternatively, we might maintain an explicit "blacklist" of types which pandas can't handle, though this gets tricky with non-required imports (e.g., dask
and xarray
objects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer in light of the above, how would you define the handled types for an object dtyped Series, in which you in principle can put any python object you want? For numpy arrays, things like addition work element-wise for object dtype, so the following works:
In [101]: class A():
...: def __init__(self, x):
...: self.x = x
...: def __add__(self, other):
...: return A(self.x + other.x)
...: def __repr__(self):
...: return "A(x={})".format(self.x)
...:
In [102]: a = np.array([A(1), A(2)], dtype=object)
In [103]: np.add(a, A(3))
Out[103]: array([A(x=4), A(x=5)], dtype=object)
In [104]: np.add(pd.Series(a), A(3))
Out[104]:
0 A(x=4)
1 A(x=5)
dtype: object
Doing a more strict check for allowed types, then the above will no longer work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is numpy actually dealing with the above? It makes an exception for object dtype?
So two ideas that might be useful here:
- We can make an exception for object dtype, and in that case allow more unknown objects
- We could check for objects that implement
__array_ufunc__
themselves? If they have the attribute, and are not known to us (not in our_HANDLED_TYPES
list, eg dask or xarray objects), we raise NotImplemented, otherwise we pass through.
That at least solves the problem for the more known objects that will probably have the interface, but of course not yet for custom container types that do not implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good questions. In principle NumPy follows the rules described at http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#behavior-in-combination-with-python-s-binary-operations but I think there is something missing for unknown scalars / object arrays...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See numpy/numpy#12258 for a description of how NumPy does things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I understand this thread: @shoyer kicked it off by saying
Normally we recommend doing a type check and returning NotImplemented if you see an unknown type
We can handle the scalar types of EAs by requiring that they implement a _HANDLED_TYPES
, which would include their scalar type. Then if all the (unboxed from Series / Index) inputs are in handled types, we proceed.
object-dtype complicates things. Did we decide what to do there? Skip the check?
@shoyer Thanks for the feedback! |
Question: do we want ufuncs to work on IntegerArray for older numpy versions? (that don't support yet @TomAugspurger for sparse, you did this I suppose by defining pandas/pandas/core/arrays/sparse.py Lines 1433 to 1439 in f928b46
but that seems a bit like a hack? (and I also don't fully understand it; this will compute the ufunc twice, and once on the densified data before the result of that is passed to |
It is indeed a hack, since it converts the data to dense. I'm not sure why the ufunc would be computed twice though. |
Because it is the result of the ufunc that is passed to |
Ah, I though the input was passed... Will take a look. |
@shoyer that would also solve what we discussed before about for object dtype not knowing what the handled types are? (currently we are skipping the type check for object dtype data) |
doc/source/development/extending.rst
Outdated
array can handle | ||
2. Defer to the :class:`Series` implementatio by returning ``NotImplemented`` | ||
if there are any :class:`Series` in the ``types``. This ensures consistent | ||
metadata handling, and associativity for binary operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in general the case for pandas containers, I think? (at least in the future if we add ufunc protocol to Index and DataFrame).
So I would make this more general than only Series, that the _HANDLED_TYPES
should not include any pandas container types (Series, Index, DataFrame)
@@ -874,6 +875,7 @@ ExtensionArray | |||
|
|||
- Bug in :func:`factorize` when passing an ``ExtensionArray`` with a custom ``na_sentinel`` (:issue:`25696`). | |||
- :meth:`Series.count` miscounts NA values in ExtensionArrays (:issue:`26835`) | |||
- Added ``Series.__array_ufunc__`` to better handle NumPy ufuncs applied to Series backed by extension arrays (:issue:`23293`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a general item for Series.__array_ufunc__
being added? Although, in principle it should not change any behaviour, as we already supported ufuncs?
pandas/core/arrays/integer.py
Outdated
def reconstruct(x): | ||
if np.isscalar(x): | ||
# reductions. | ||
if mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 'scalar' branch, is that this relevant if we raise above for 'reduce' methods? (I don't think any of the non-reduce ufuncs can return a scalar
I think this is good enough to put in the RC |
Yes, I think this would be an elegant way to handle this. Here's what I wrote up last October: It's up to you whether you want to check for |
Ah, thanks for the link to that issue. Forgot about that discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments; ok to merge for the rc
doc/source/development/extending.rst
Outdated
|
||
As part of your implementation, we require that you | ||
|
||
1. Define a ``_HANDLED_TYPES`` attribute, a tuple, containing the types your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we now recommeding setting __array_priority__
?
pandas/core/series.py
Outdated
handled_types = sum([getattr(x, '_HANDLED_TYPES', ()) for x in inputs], | ||
self._HANDLED_TYPES + (Series,)) | ||
any_object = any(getattr(x, 'dtype', None) == 'object' for x in inputs) | ||
# defer when an unknown object and not object dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line before comments
I'll update to incorporate #23293 (comment) later today. Fixing the CI for pandas and dask first. |
also, is there any way we can pin numpy 1.17rc1 in one of the builds just to be sure (as saw some other breakages on this); even though we test master. |
I tried to implement the array_priority / array_ufunc check. A bit rushed today so I may have messed something up. The main wrinkle is that |
If I am understanding correctly, we should only check for the presence of |
Ah, OK I missed that in the discussion. Just to make sure, we only just |
FYI, all green here. |
thanks @jorisvandenbossche and @TomAugspurger very nice. would be even nicer to shake this down a bit in master...... |
any open issues for this? |
#22798 is related, but need to check if that can be fully closed.
It might be that we should still check the |
On the scalar __array_priority__, we should be OK as long as its lower than
Series.__array_priority__?
…On Tue, Jul 2, 2019 at 9:06 AM Joris Van den Bossche < ***@***.***> wrote:
any open issues for this?
#22798 <#22798> is related,
but need to check if that can be fully closed.
Ah, OK I missed that in the discussion. Just to make sure, we only just
Series._HANDLED_TYPES? We don't do the recursive getattr(input,
'_HANDLED_TYPES', ())?
It *might* be that we should still check the _HANDLED_TYPES of underlying
EAs.
Some EAs will want to handle scalars that have a __array_priority__ set
(like our own scalars), although this is maybe a sign that we should not
define __array_priority__ on them ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23293?email_source=notifications&email_token=AAKAOIW6L6Q3DGFMJG2A4ODP5NOFVA5CNFSM4F6UI2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBMLIA#issuecomment-507692448>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRTUSX3D7LJONJBFQLP5NOFVANCNFSM4F6UI2YA>
.
|
It seems we use 1000 for Series and EAs, and 100 for Timestamp and Timedelta, so that should be OK then? |
That's my understanding.
…On Tue, Jul 2, 2019 at 9:11 AM Joris Van den Bossche < ***@***.***> wrote:
It seems we use 1000 for Series and EAs, and 100 for Timestamp and
Timedelta, so that should be OK then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23293?email_source=notifications&email_token=AAKAOISPBW64J332Y4FZDFTP5NOZNA5CNFSM4F6UI2YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBM43Q#issuecomment-507694702>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIWW7S6SWRDDGKDR4PDP5NOZNANCNFSM4F6UI2YA>
.
|
This PR:
__array_ufunc__
implementation forIntegerArray
(to be able to check it is correctly used from Series for ExtensionArrays)Series.__array_ufunc__
(not yet forIndex
)@TomAugspurger revived my branch, will try to work on it a bit further this afternoon
What this already does is a basic implementation of the protocol for IntegerArray for simple ufuncs (call) for all IntegerArrays, and Series dispatching to the underlying values.
One question is if we want to force EA authors to implement an
__array_ufunc__
(eg by having a default implementation returning NotImplemented).